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Abstract 

The hybrid pigeonpea (Cajanus cajan) breeding technology based on cytoplasmic male sterility (CMS) is 
currently unique among legumes and displays major potential for yield increase. CMS is defined as a condi- 
tion in which a plant is unable to produce functional pollen grains. The novel chimeric open reading frames 
(ORFs) produced as a results of mitochondrial genome rearrangements are considered to be the main cause 
of CMS. To identify these CMS-related ORFs in pigeonpea, we sequenced the mitochondrial genomes of three 
C. cajan lines (the male-sterile line ICPA 2039, the maintainer line ICPB 2039, and the hybrid line ICPH 
2433) and of the wild relative {Cajanus cajanifolius ICPW 29). A single, circular-mapping molecule of 
length 545.7 kb was assembled and annotated for the ICPA 2039 line. Sequence annotation predicted 51 
genes, including 34 protein-coding and 1 7 RNA genes. Comparison of the mitochondrial genomes from dif- 
ferent Cajanus genotypes identified 31 ORFs, which differ between lines within which CMS is present or 
absent. Among these chimeric ORFs, 1 3 were identified by comparison of the related male-sterile and main- 
tainer lines. These ORFs display features that are known to trigger CMS in other plant species and to represent 
the most promising candidates for CMS-related mitochondrial rearrangements in pigeonpea. 
Key words: mitochondria; pigeonpea; next-generation sequencing; cytoplasmic male sterility; open reading 
frames 



1. Introduction 

Angiosperm mitochondrial genomes are unique in 
eukaryotes because of their high rates of rearrange- 
ment, sequence duplication, ongoing gene loss, and fre- 
quent incorporation of foreign DNA. 1-3 Land plant 
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mitochondrial genomes vary in size from 1 05 4 to 
>1 1 000 kb. 5 Hence, the smallest land plant mito- 
chondrial genome (Physcomitrella patens, 105 kb) is 
still ~1 1 times larger than the human mitochondrial 
genome 6 (~1 6 kb). Several studies have reported the 
presence of subgenomic circles in mitochondrial 
genomes that have arisen from recombination 
events. 7,8 While such recombination events in plant 
mitochondria increase the complexity of their 
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genome structures, recombination has also been pro- 
posed to maintain genomic stability and may also 
provide a mechanism to increase genetic variation in 
the absence of sexual reproduction. 9,1 0 

Rearrangements in mitochondrial genomes are of 
considerable biotechnological interest as they can 
cause cytoplasmic male sterility (CMS), which is a valu- 
able tool for plant breeding programmes. Male sterility 
is caused by the failure of a plant to produce functional 
pollen grains. 1 1 CMS is a maternally inherited trait and 
is mainlycontrolled by the mitochondrial genome. CMS 
is often found to be caused by chimeric mitochondrial 
open reading frames (ORFs) that are produced as a 
result of mitochondrial genome rearrangements. 1 1-14 
In many cases of CMS, male fertility can be restored by 
the introduction of nuclear genes known as restorer- 
of fertility (/?/) genes. 

Although plant breeders have used CMS technology 
for producing Ft hybrids for enhancing crop productiv- 
ity in numerous cereal and vegetable crops, the devel- 
opment of F, hybrids has not been possible in 
legumes because of their high levels of self-pollination. 
In pigeonpea, however, a moderate level of insect- 
mediated out-crossing exists that could be used to 
develop a stable CMS system. 15 In 2005, Saxena 
et al. 1 5 derived a stable CMS system, ICPA 2039, from 
an interspecific hybrid of cultivated pigeonpea 
(Cajanus cajan) and a wild relative (Cajanus cajanifolius) 
(Supplementary Fig. S1). Previous CMS systems have 
been attempted in pigeonpea, 1 6,1 7 but have been 
unsuccessful, mostly as a result of instability in the 
expression of male-sterility and -fertility restoration. 1 5 
The development and utilization of stable male-sterile 
lines from different cytoplasmic backgrounds are a 
key factor to the diversification of pigeonpea hybrid 
parental lines. Indeed, male-sterility systems in many 
crops do not allow the generation of completely 
male-sterile progenies, drastically limiting the use of 
male-sterile lines in Ft hybrid seed production. 1 8 To ac- 
celerate hybrid pigeonpea breeding for yield and 
quality, understanding the molecular basis of male 
sterility is critically important. Specifically, the identifi- 
cation of CMS-associated genetic polymorphisms is a 
key pre-requisite for rational development of new and 
improved CMS systems for the production of superior 
Ft hybrids. Next-generation sequencing (NGS) has pro- 
vided oppurtinitiestogain the genetic information in a 
much faster and cost effective manner. NGS of mito- 
chondrial genomes and analysis of genetic variations 
across the genomes of male-sterile, maintainer, and 
wild relative species will facilitate the identification of 
genetic features related to male sterility. 

This study reports the generation and analysis of 
mitochondrial genome sequences of four Cajanus gen- 
otypes:the male-sterile line ICPA2039,the maintainer 
line ICPB2039, the hybrid linelCPH 2433, and the wild 



relative ICPW29 (C. cajanifolius). A high-quality pigeon- 
pea mitochondrial genome assembly has been devel- 
oped for ICPA 2 039. This study provides the first 
comparative study of legume mitochondrial genome 
sequences and identifies several re-arrangements and 
no-coverage regions (large regions >1000 bp; with 
zero coverage), as well as chimeric ORFs associated 
with CMS in pigeonpea. 

2. Materials and methods 

2.1 . Plant material and mitochondrial DNA isolation 
Cajanus lines ICPA 2039, ICPB 2039, ICPH 2433, and 

ICPW 29 were used asthe source of mitochondrial DNA 
(mtDNA). mtDNAwas isolated from 2-week-old etiol- 
ated seedlings and was purified before sequencing. 19 

2.2. Sequencing and assembly 

Mitochondrial genomes of four pigeonpea lines were 
pyrosequenced with the Roche/454 FLX sequencing 
platform following whole-genome amplification (WGA). 
WGA kit GenomePlex from Sigma (Sigma-aldrich, 
St. Louis, USA) was used in this study. Twenty nanograms 
of DNA template were used for WGA according to the 
protocol from manufactures. In summary, the WGA 
process was divided into fragmentation, library 
generation, and PCR amplification. The first two steps, 
fragmentation and library generation (3 kb of insert 
size), were carried out without interruption, to avoid 
the DNA degradation. Further to amplify higher 
amount of DNA, the GenomePlex reaction was 
allowed to proceed for 4 h. De novo genome assembly 
of the reference genome (ICPA 2039) was performed 
using Newbler, Celera, and CLC bio software programs. 
All the usable reads were aligned onto the contig 
sequences, and aligned paired-end sequences (PEs) 
were obtained. We then calculated the amount of 
shared PE relationships between each pair of contigs, 
weighted the rates of consistent and conflicting PEs, 
and then constructed the scaffolds step-by-step, 
beginning with the shortest insert-sized PEs, to long 
insert-sized PEs. Assemblies generated by the Newbler 
assembler were considered most robust in terms of 
length of the scaffolds and genome coverage and were 
used for further analysis. Gaps within the assembly 
were identified using contig-graph information. The 
Perl script, parse_link.pl, was used to identify and 
close the gaps in silico 20 (http://www.cbcb.umd.edu/ 
finishing/finishing-v1 .tar.gz). Remaining gaps were 
filled by Sanger sequencing. Graphs were generated 
for a preliminary view of the assembly (Fig. 1) in an 
effort to check the order and orientation of mitochon- 
drial scaffolds in the genome. Scaffolds that were not 
connected to other scaffolds in graph and showed low 
coverage were suspected to be part of the choloroplast 
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Figure 1. A scheme showing linkingscaffold with the helpof graph in 
a preliminary view of the assembly. Assembly graphs were used as a 
guide to connect the scaffolds. Each box represent a scaffold, '||' 
represent the 3' end, and '|>' represent the 5' end of each 
scaffold. Number on each scaffold represents the scaffold 
number and size of the scaffold. The thick black lines indicate 
that the scaffolds are attached in correct orientation and spotted 
line indicate that the scaffolds are attached in reverse 
orientation in the assembly. Numbers on these lines represents 
the sequence coverage. The orientation of each scaffold was 
confirmed through Sanger sequencing. 



genome. BLASTN searches were performed for these 
scaffolds against the NCBI database to validate these 
scaffolds are contamination from the chloroplast 
genome. Assembly graphs were used as a guide to 
connect the scaffolds. Primers were designed from the 
ends of the scaffolds that showed connections with 
other scaffolds in assembly graphs. The orientation of 
each scaffold within the assembly was confirmed by 
Sanger sequencing. 



order of our Cajanus genomes in a pairwise fashion 
with that of the following 1 1 angiosperms: Vigna 
radiata, 24 Triticum aestivum 25 Oryza sativa 26 lea 
mays 27 Arabidopsis thaliana 28 Beta vulgaris 29 Citrullus 
lanatus 2 ^ Cucurbita pepo 2 ^ Nicotina tabacum, 30 Vitis 
vinifera 3 ^ and Cucumissativus. 32 



2.5. Genome alignment and representation 

The scaffolds of the three other mitochondrial 
genomes (ICPB 2039, ICPH 2433, and ICPW 29) were 
aligned with the finished assembly of ICPA 2039 
genome using BLASTN. Circular genome representa- 
tions of ICPA 2039 genome were generated using 
OG DRAW. 3 3 Three different maps were generated to 
represent the alignment of ICPA 2039 genome with 
the three other genomes. Figures were scaled down to 
integrate all the four maps in a single map. 

2.6. Identification of rearrangements and no-coverage 
regions 

Comparative assemblies of all three genomes were 
generated using GS Reference Mapper 2.5. Raw reads 
of each mt genome were aligned to that of ICPA 2039 
genome assembly in order to detect any sequence- 
level differences between them. Rearrangements with 
>60% frequency were considered for further analysis. 
No-coverage regions were extracted using a custom 
Perl script, which checks the coverage of every base in 
the assembly and groups the consecutive positions 
where coverage is very low. Regions >1 kb and with ap- 
proximately zero coverage were considered as no- 
coverage regions. 



2.3. Gene prediction and annotation 
Protein-codingand RNA genes were predicted by per- 
forming BLASTX and BLASTN searches, respectively, 
against a database of protein -coding, tRNA, and rRNA 
genes complied from all previously sequenced seed 
plant mitochondrial genomes. 21 tRNAscan-SE 22 was 
used to corroborate the tRNA boundaries identified 
by BLASTN. A BLAST score of e-value <1e-3 and 
percent identity threshold of >70% were initially 
used for filtering BLAST outputs. Gene boundaries 
were extended or trimmed to the positions of the start 
and stop codons manually using Artemis 1 2.0. 23 
Annotation data were written to a Sequin-formatted 
table file with a set of Perl and CGI scripts. 

2.4. Gene-order comparison 

To identify colinearity between the pigeonpea mito- 
chondrial genome and other angiosperms, we used 
BLAT (Standalone BLAT v. 34) with an identity cut-off 
of >0.9 and coverage of >0.5 to compare the gene 



2.7. Chimeric ORFs 

Sequences for ORFs >1 00 codons in the vicinity of 
rearrangements or within no-coverage regions were 
collected using Artemis 1 2.0. 23 ORFs coding for 
known mitochondrial genes were excluded from the 
analysis. Further, these ORFs were blasted against the 
ICPA 2039 genome itself in order to check whether 
these ORFs carry part of other genes or ORFs. First hit 
of the blast match were left as that will be the original 
location of these ORFs. All the other hits showing iden- 
tity >95%and sequence coverage of >1 6 bp were con- 
sidered. Further, these ORFs were checked in terms of 
their closeness to any predicted gene. Potential trans- 
membrane helices were predicted with TMHMM 
2.0. 34 A scoring crietria from 0 to 4 was assigned to 
each ORF, one for the presence of parts of other genes, 
one for the proximity of any predicted genes, one for 
the presence of hydrophobic domains, and one add- 
itional score for carrying parts of atp genes. ORFs 
showing score of >3 were considered as the potential 
chimeric ORFs. 
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3. Results and discussion 

Pigeonpea is an important legume crop for resource- 
poor smallholder farmers in marginal environments. 
Unfortunately, the productivity of this legume staple 
crop has stagnated atca. 750 kg/ha due to its exposure 
to biotic and abiotic stresses. Pigeonpea recently 
became the first legume to have Ft hybrids released 
based on a CMS system. 35 The initial pigeonpea Ft 
hybrids (e.g. ICPH 2671) showed >30% yield advan- 
tages over the best pure line varieties in the same geo- 
graphic regions. Such advances clearly indicate that 
pigeonpea Ft hybrid technology has the potential to 
break the current yield plateau. For successful and sus- 
tainable pigeonpea hybrid production and extension, 
the following are critical factors: (i) diversification of 
parental lines and CMS sources, (ii) improvement of 
parental lines for tolerance to biotic and abiotic stres- 
ses, and (iii) maintaining the purity of hybrid seeds. In 
this context, improvement of parental lines is underway 
through conventional and molecular breeding 
approaches that are being accelerated by the availabil- 
ity of a sequenced pigeonpea genome. 36 Simple se- 
quence repeat (SSR) markers-based Ft hybrid purity 
testing has also been initiated for ensuring the purity 
of hybrid seeds. 37,38 However, major challenges 
remain in relation to the need for diversification of 
CMS sources in the pigeonpea genepool. Although 
seven cytoplasmic sources are available (Cajanus 
sericeus, C. scarab <a eoides, C. volubilis, C. cajanifolius, 
C. cajan, C. lineatus, and C. platycarpus), only C. cajani- 
foius has currently been commercially exploited. The 
other six sources have not been able to be used 
commercially, because they express the CMS trait at 
an adequate level. To understand the factors condition- 
ing the efficiency differences between sources of CMS, 
it will first be necessary to understand the molecular 
basis of CMS in pigeonpea. 

3.1 . Sequencing and assembly of four mitochondrial 
genomes of pigeonpea 
We used Roche/454 FLX technology, targeted Sanger 
sequencing,and computational approaches to produce 
one complete and three draft assemblies for the mito- 
chondrial genomes of four Cajanus lines. Northern 
blot-based screening and shotgun sequencing have 
been the conventional approaches used to identify 
CMS-associated chimeric ORFs in different plant 
species, including maize, 39 sugar beet, 40 rice, 41,42 
wheat, 25 and brassica. 43 Recently, Bentolila and 
Stefanov 44 successfully identified a candidate for the 
wild abortive CMS-encoding gene by pyrosequencing 
of two rice mitochondrial genomes using Roche/454 
sequencing technology. However, Northern screening 
approaches used for the same rice genomes failed to 



identify any potential CMS candidates in wild abortive 
rice CMS lines. 44 Therefore, we used Roche/454 se- 
quencing technology and a de novo assembly approach 
to sequence the mitochondrial genome of Cajanus 
species. Due to the unknown genome architecture of 
the pigeonpea mitochondrial genome, we did not rely 
entirely on in silico approaches, but also validated pro- 
posed interscaffold connections by Sanger sequencing. 
To identify regions which varied in a mannercorrelated 
with CMS, the four mitochondrial genomes were then 
compared usingthe ICPA2039 assembly asa reference. 

Roche/454 sequencing of four genomes from puri- 
fied mtDNA generated totals of 38.8, 1 5.6, 3 7.1 , and 
23.8 Mb of paired-end data for ICPA 2039, ICPB 
2039, ICPH 2433, and ICPW 29, respectively. The se- 
quencing reads were assembled into scaffolds using 
three different de novo assembly programs— Newbler, 
CLCBio, and Celera. Assemblies generated by the 
Newbler assembler were considered as the best assem- 
blies (Table 1),with scaffold N50 values of 169.6, 1.2, 
1 69.9, and 1 08.1 kb for ICPA 2039, ICPB 2039, ICPH 
2433, and ICPW 29, respectively. Additional sequence 
data were generated for the comparatively under- 
sequenced genotypes of ICPB 2039 (1 5.1 Mb) and 
ICPW 2 9 (2 3.2 Mb). The sequencing data for these 
were reassembled and N50 of two assemblies thus 
improved from 1.2 to 1 2.8 kb for ICPB 2039 and 
from 108.2 to 1 59.2 kb for ICPW 29. In summary, 
mean scaffold lengths of 22.4, 8.8, 6.3 and 37.3 kb 
were achieved for the pigeonpea mitochondrial 
genomes of lines ICPA 2039, ICPB 2039, ICPH 2433, 
and ICPW 29, respectively. Analysis of sequence data 
for GC content indicated similar GC content distribu- 
tion in all of the four genomes (Supplementary Fig. S2). 

3.2. Finishing of reference mitochondrial genome ICPA 
2039 

The scaffolds of ICPA 2 03 9 were further refined by re- 
moving contamination from nuclear or chloroplast 
genomes, and by closing gap regions within these scaf- 
folds using Sanger sequencing. A preliminary view of the 
ICPA 2039 assembly was generated to check the con- 
nections between the scaffolds and for the removal of 
contaminants (Fig. 1 ).Scaffoldsfromthe mitochondrial 
genome assembly were selected based on their cover- 
age and links with the other scaffolds as shown in 
Fig. 1 .Selected scaffolds were subjected toBLASTNana- 
lysis against the NCBI database. Of 30 selected scaffolds 
of ICPA 2039, with a total length of 672 91 8 bp, seven 
were confirmed to be mitochondrial in origin, repre- 
senting 532 372 bp or 79% of the total sequence 
data (Table 1). Seven scaffolds were homologous to 
cpDNA, representing 78 461 bp (1 1 .6% of the total se- 
quence). Eleven further scaffolds matched nuclear DNA 
representing 46 22 8 bp (6.9% of the sequence). Two 
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Table 1. Generation of 454/FLX data and assembly statistics of ICPW 29, ICPA 2039, ICPB 2039, and ICPH 2433 



Genotypes 


Number of 


Newbler 










Celera 








CLC Bio 






sequence reads 

(length) 

generated 


Number 
of 

scaffolds 


Bases in 
scaffolds (bp) 


N50 
scaffold 
size (bp) 


Number of 

large 

contigs 3 


Bases in 
large 

contigs (bp) 


Number 
of 

scaffolds 


Bases in 

scaffolds 

(bp) 


Number 
of big 
contigs b 


Big contig 

length 

(bp) 


Number 
of contigs 


Bases in 
contigs (bp) 


ICPW 29 


74 1 09 
(23.8 Mb)/ 
1 64 071 
(53.4 Mb) c 


1 56/1 8 C 
(4) d 


723 814/ 
672 1 37 c 
(575 487) e 


1 08 1 93/ 
1 59 243 c 


202/425 c 


694 265/ 
858 696 c 


34 


61 8 692 


1 5 


501 689 


392 


951 760 


ICPA 2039 


121 1 70 
(38.8 Mb) 


30(7) d 


672 91 8 
(532 372) e 


1 69 595 


345 


828 279 


84 


662 357 


20 


475 091 


1 348 


1 782 550 


ICPB 2039 


51 723 
(1 5.6 Mb)/ 
1 1 7 1 63 
(36.4 Mb) c 


387/52 c 
(34) d 


415 181/ 
459 802 c 
(335 926) e 


1 1 53/ 
1 2 823 c 


430/669 c 


404 882/ 
716 244 c 


1 1 3 


199 602 


0 


0 


1 032 


529 436 


ICPH 2433 


1 1 6 021 
(37.1 Mb) 


108 (9) d 


681 810 
(539 865) e 


1 69 903 


1 84 


677 1 58 


44 


577 054 


1 8 


468 61 1 


564 


1 227 305 



a Newbler assembler classifies contigs >500 bp as large contigs. 

b Celera assembler classifies contigs > 1 0 kb as big contigs. 

c Data and assembly statistics of ICPW 29 and ICPB 2039 additional reads. 

d Number of scaffolds from the mitochondrial genome. 

e Base in scaffolds from the mitochondrial genome. 



CO 



490 



Mt Genome Analysis in Cajanus spp. 



[Vol. 20, 



scaffolds representing 8268 bp (1 .2% of the sequence 
data) matched the sequence of the plasmid DNA. The 
remaining three scaffolds representing 8309 bp (1 .2%) 
of sequence data did not show any match in the NCBI 
database and may represent sequences unique to 
pigeonpea. The seven scaffolds that matched other 
plant mtDNA were targeted for further analysis. A total 
of 38gaps(26 830 bp) were observed inthe seven mito- 
chondrial scaffolds, represented by Ns in the assemblies. 
Using the parse_link.pl script (http://www.cbcb.umd. 
edu/finishing/finishing-v1 .tar.gz) from the finishing 
toolbox (see Materials and Methods), 47 contigs from 
contigs that had not previously been assembled into 
scaffolds were introduced to fill 38 gaps inside the 
seven scaffolds. Two gaps (of 65 bp each), which the 
script was unable to fill in silico, were closed using 
Sanger sequencing technology. Subsequently, assembly 
graphs were used as a guide to connect the scaffolds. 
To confirm the order and orientation of each scaffold 
within the assembly, primer pairs were designed from 
the ends of each scaffold based on their connections 
with other scaffolds in assembly graphs. A set of 24 
primer pairs (Supplementary Table S1) were used to 
generate amplicons and sequence data were generated 
for these using Sanger sequencing technology. In this 
way, a high-quality, circular-mapping mitochondrial 
genome of 545 742 bp in total length, with ~23-fold 
coverage, was assembled for ICPA 2039. This master 
circular molecule contains a large recombinationally 
active repeat of size 4951 bp, which is extending from 
positions 531 745 to 536 696 bp. Recombinationally 
active large repeats are a very common feature of plant 
mitochondrial genomes. 45,46 

3.3. Gene content 

Within the mitochondrial genome of ICPA 2039, 
we identified 34 protein-coding, 14 tRNA, and 3 
rRNA genes (Supplementary Table S2), for a total of 
29 346 bp of protein exons; 31 01 8 bp of intronic se- 
quence; 52 55 bp of rRNA genes; and 1 477 bp of tRNA 
genes (Table 2). We did not find a copy of, cox2, con- 
firmed proposals that this gene has lost in the legume 



Table 2. Genome coverage by coding features in ICPA 2039 
mitochondrial genome assembly 



Class 


Feature 


ICPA2039 a (%) 


Total size 




545 742 bp 


Coding 


Protein exons 


29 346 bp (5.4) 




Introns 


31 01 8 bp (5.6) 




rRNA 


5 255 bp (0.9) 




tRNA 


1477 bp (0.2) 


Non-coding 


Mitochondria-like 


220 747 bp(40.5) 




Nuclear-like 


40 330 bp (7.4) 



a Figure in parentheses represents the percentage of total size. 



lineages. 24,47 In contrast, we found ICPA 2039 to 
contain two identical copies of the cox3 gene. Multiple 
copies of tRNAs for cysteine, lysine, and methionine 
were observed to be present in the mitochondrial 
genome of ICPA 2039. The tRNA genes, carrying methio- 
nine, are also highly similar to those of the plastid as is 
likely derived from the cpDNA, as is that of tryptophan 
(Supplementary Table S2). 

To annotate mtDNA-encoded genes in the other 
sequenced lines, their scaffolds were compared with 
ICPA 2039. Six scaffolds derived from the ICPH 2433 
hybrid contained 32 protein-coding genes and 12 
tRNA genes between them (Supplementary Table S3). 
Four mitochondrial scaffolds of ICPW 29 covered 33 
protein-coding and 14 tRNA genes (Supplementary 
Table S4). As in ICPA 2039, multiple copies of cysteine, 
lysine, and methionine-tRNA genes were present in 
the scaffolds of ICPH 2433 and ICPW 29. Due to low 
sequence coverage and small scaffold size, the 1 7 mito- 
chondrial scaffolds of ICPB 2039 line only covered 
1 5 protein and 1 1 tRNA genes between them 
(Supplementary Table S5). 

3.4. Structural features of the pigeonpea mitochondrial 
genome compared with other plant species 
The assembled Cajanus mitochondrial genome was 
compared with the mitochondrial genomes of 
1 1 other land plantspecies.Thespeciesforcomparison 
include one legume— V. radiata, 24 three cereals— T. aes- 
tivum 25 O. sativa 26 Z. mays 27 and seven other eudi- 
cots, including A. thaliana 28 B. vulgaris 29 Citrullus 
lanatus 2 ^ Cucurbita pepo 2 ^ N. tabacum 30 Vitis v'mi- 
fera 3y and Cucumissativus. 32 In terms of mitochondrial 
genome size, the pigeonpea mitochondrial genome 
(545 742 bp) is substantially larger than the mito- 
chondrial genome of closest sequenced legume 
species V. radiata 24 (401 262 bp). The size of pigeon- 
pea mitochondrial genome was found to be compar- 
able in size with the mitochondrial genomes of cereal 
species, e.g. O. sativa (490 kb),Z. mays (569 kb), and 
greater than the median angiosperm mitochondrial 
genome size (473 kb). Mitochondrial genome size 
can not only vary between different species, but can 
also show variations between different lines of the 
same species. For instance, the genome size of five 
sequenced maize mitochondrial genomes is known to 
vary from 535 825 to 739 71 9 bp. 39 The plant mito- 
chondrial genomes are rich in non-coding regions and 
are highly variable in their non-coding regions. In our 
analysis, 12.29% of the pigeonpea mitochondrial 
genome was covered by the coding regions, which is 
lower than but comparable to V. radiata 24 (1 6.88%), 
Brassica napus 43 (1 7.34%), and Citrullus lanatus 2 ^ 
(1 8.8%). However, the coding regions of the pigeonpea 
mitochondrial genome were found to be greater than 
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the Cucurbita pepo in which only 6.9% 21 of the mito- 
chondrial genome is covered by coding regions. 
Cucurbita pepo has a particularly large mitochondrial 
genome(982 833 bp), due to the insertions of chloro- 
plast (>113kb) and short repeated sequences 
(>370 kb) in the mitochondrial sequences. 21 The 
available plant mitochondrial genome sequences 
therefore suggest that gene composition does not 
depend on total genome size. Gene numbercan be con- 
served despite changes in size derived from the inser- 
tion of repetitive sequences, or of sequences 
transferred from the chloroplast and nucleus into inter- 
genic regions. In addition, a large numberof rearrange- 
ments of genes and ORFs are common features in plant 
mitochondrial genomes. 

3.5. Comparison of gene order with other plant species 
We also compared gene maps of protein-coding and 
rRNA genes encoded by the pigeonpea mitochondrial 
genome (ICPA 2039) with those of the other 11 
sequenced plant species. Unsurprisingly, the highest 
level of synteny for the mitochondrial genomes was 
observed between C. cajan and the related legume 
V. radiata. For instance, one four-gene cluster (cox3- 
nad4L-atp4-rps1 Oab), two three-gene clusters (nad6- 
nadl-ccmB and rplS-rpsI 4-cob), and three two-gene 
clusters (rps3ab-rpl1 6, rpsl 2-nad3, and ccmC-ccmFn) 
were syntenic between these two species (Fig. 2). On 



Hi1mHfii!l11iUmtlm?l 



iti 




Figure 2. Correlation of gene order between the mitochondrial gene 
maps of C. cajan and V. radiata. Left-hand side is represented by 
genes identified in C. cajan and top side is represented by genes 
of V. radiata. Shaded blocks in the image represent the 
correlation of gene orders. 



the other hand, only two two-gene clusters {nad3- 
rpsl 2 and ccmC-ccmFn) showed synteny in mitochon- 
drial genomes of C. cajan with three cereals species 
analysed (Supplementary Figs S3-S5). Two two-gene 
clusters (nad2ab-atp1 and rps3ab-rpll 6) of the 
C. cajan mitochondrial genome showed synteny when 
compared with wheat and maize (Supplementary Figs 
S3 and S4). The most highly conserved gene cluster 
was the two-gene cluster of rps3ab-rpl1 6, which was 
syntenic across 7 of the 1 1 species (Fig. 2 and 
Supplementary Figs S3-S1 2). 



3.6. Comparison of mitochondrial genome sequences 
among Cajan us lines 
Comparisons of gene order typically highlight the 
high rate of mitochondrial genome rearrangements 
between different plant species. Our analysis indicates 
that the mitochondrial genome of pigeonpea shares 
only six gene clusters with the closely related sequenced 
species V. radiata, 24 which between them cover only 
1 6 genes. Fewer gene clusters was observed to be con- 
served in comparison with more distantly related 
species. Due to the dynamic nature of the plant mito- 
chondrial genomes, extensive structural variations can 
also be expected to occur in different lines of a given 
species. 46 Comparison of mitochondrial genomes 
from fivedifferent lines of maize revealed 1 6 rearrange- 
ments, eve n bet wee n two fe rt i le cy to ty pes. 3 9 To u n d e r- 
stand the patterns of genetic variation associated with 
CMS in pigeonpea, and its effects on maternal inherit- 
ance, we first aligned the scaffolds of the mitochondrial 
genomes of the three Cajanus lines (i.e. ICPB 2039,ICPH 
2433, and ICPW 29) using BLASTN, together, with that 
of the male-sterile line ICPA 2039. This demonstrated 
that the genomes of ICPA 2039 and ICPB 2039 are 
highly diverged from each other. Conversely, the mito- 
chondrial genome of the hybrid ICPH 2433 (produced 
from the ICPA 2039 x ICPR 2433 cross) showed the 
highest level of synteny with the ICPA 2039 line, 
followed by ICPW 29 (Fig. 3), supporting the model 
that the CMS trait is maternally inherited in pigeonpea 
hybrids. 

The sequence-level divergence between the mito- 
chondrial genomes of ICPA 2039 and ICPB 2039 lines 
identified by BLASTN was further validated by use of 
the GS Reference Mapper 2.5, which is commonly 
used for mapping 454 reads to a reference assembly. 
Mapping raw reads of ICPB 2039, ICPH 2433, and 
ICPW 29 on to the assembly of ICPA 2039, a 
maximum number of rearrangements were observed 
in the fertile and sterile lines of ICPA 2039 system 
(Supplementary Fig. S1 ). While building the compara- 
tive assemblies, we identified the no-coverage regions 
along with the rearrangements in order to reduce the 
effects of sequencing artifacts in further comparison. 
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□ complex I (NADH dehydrogenase) 

□ complex II (succinate dehydrogenase) 

□ complex lit (ubichinol cytochrome c reductase) 

□ complex IV (cytochrome c oxidase) 

□ ATP synthase 

□ ribosomal proteins (SSU) 

■ ribosomal proteins (LSU) 

■ other genes 

■ transfer RNAs 

■ ribosomal RNAs 

Figure 3. Alignments of Cajanus mitochondrial genomes. The outer circle represents the finalized mitochondrial genome assembly and gene 
annotation of male-sterile line ICPA 2 03 9. Second, third, and fourth circles from the outer circle represent the scaffolds of ICPH 2433, ICPW 
29, and ICPB 2039 mapped on ICPA 2039 assembly. Numbers on each circle represent the scaffolds of each line. Hn represent the scaffolds 
for ICPH 2433, W/i represent the scaffolds for ICPW 29 and B/i represent the scaffolds for ICPB 2039, where n is the scaffold number. 



Twenty-two rearrangements and 1 7 no-coverage 
regions were observed in ICPB 2039 compared with 
ICPA 2 039. Using the same criteria, 9 rearrangements 
and 1 2 no-coverage regions were observed in the mito- 
chondrial genome of ICPW 29 when compared with 
that of ICPA 2 039. These nine rearrangements could 
be the result of differences occurred during the main- 
tenance of the CMS cytoplasm. We do not expect 
these differences to be associated with the CMS trait 
as the wild relative line is the maternal parent of the 
sterile line. The mitochondrial genome of ICPH 2433 
was found to be closest in sequence to that of ICPA 
2039, with no differences observed between these 



lines (Supplementary Fig. S1 3 and Supplementary 
Tables S6 and S7). 

3 . 7. Candidate CMS-associated chimeric ORFs 

The CMS trait is often associated with chimeric ORFs 
that are the products of mitochondrial genome re- 
arrangement, 48 which can cause pollen abortion. In 
some crops, chimeric ORFs are found in sterile lines 
(e.g. A-line), but absent in fertile lines (e.g. B- and 
R-lines). 1 3,1 4,39 A number of studies have confirmed 
the role of chimeric ORFs in male sterility bydisrupting 
the function of ORFs by inserting or deleting a few base 
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pairs (reviewed by Hanson and Bentolila). 48 Manyofthe 
chimeric genes associated with CMS in other crops are 
found in the proximity of protein-coding genes and 
include regions encoding transmembrane domains 
and other parts of known mitochondrial genes. 48-52 
Hence, we set out to identify those chimeric ORFs, 
which most closely resemble these criteria by scanning 
around positions that have undergone rearrangements 
or which are absent from particular pigeonpea lines. 
Only ORFs >300 bp were considered, and were ranked 
as chimeric based on the presence of parts of other 
genes, proximity to known mitochondrial genes, 
and the presence of hydrophobic domains. A scoring 
systems ranging from 0 to 4 was assigned to each ORF 
(see Chimeric ORFs section of Materials and methods). 
As abnormal atp synthase genes are sometimes asso- 
ciated with CMS, 49-52 ORFs containing parts of atp 
genes were more heavily weighted. 

Our study identifies 13 such potential CMS candi- 
dates in the pigeonpea male-sterile line ICPA 2039 
(Table 3). Of these 1 3 potential candidates, five 
carry parts of other mitochondrial genes and eight 
were observed to be in the proximity of other mito- 
chondrial genes. Liu et al. 25 have hypothesized that a 
wheat K-type CMS line, Ks3, contains a chimeric ORF 
encoding partial subunits of several components of 
the respiratory chain complex, including, atp4, atp6, 
nad3, nad6, nad9, coxl , and cox3. These altered 
proteins may interfere with the normal function of 
respiratory chain reactions and cause pollen develop- 
ment to abort. Intriguingly, five of the candidates iden- 
tified in our study incorporate parts of some of these 
genes including, atpl, nad4, rps4, nad5, and atp9. 
This presents the possibility of a similar mechanism 
in pigeonpea CMS as found in rice. Transmembrane 



domains are another prominent feature associated 
with CMS ORFs. 53 Many of our candidate ORFs 
carry regions predicted to encode transmembrane 
domains, and a number of the encoded proteins 
have been shown to be associated with the inner 
mitochondrial membrane. 48 Recently, Bentolila and 
Stefanov 44 have identified a candidate for the wild 
abortive CMS in rice that has arisen via rearrange- 
ment, is chimeric in structure, possesses predicted 
transmembrane domains, as well as possess the pro- 
moter of a mitochondrial gene. Of our 1 3 candidates, 
1 1 are predicted to carry such transmembrane 
domains. These novel ORFs may trigger CMS by dam- 
aging mitochondrial membrane structure such that 
the resulting permeability change affects mitochon- 
drial function. 48,54 Previous histological studies of 
CMS in pigeonpea have revealed that meiosis in both 
male-fertile and male-sterile plants proceeds normally 
up to the tetrad stage, and that during this period, 
the tapetum remains intact. Male sterility becomes 
manifest after this, with tetrads in male-sterile plants 
remaining enclosed within a persistent tetrad wall 
and subsequently undergoing vacuolation and abor- 
tion of pollen grains. 55 Therefore, identifying the 
ORFs that are causative for CMS in pigeonpea will 
require the transcription and translation patterns of 
our unique ORF candidates to be determined, includ- 
ing in young to mature buds, floral parts including the 
pollen mother cell, tetrad, and pollen grains. The roles 
of transmembrane domains and respiration in the 
mitochondrial genome of ICPA 2039 will also need 
to be assessed. Future, structural and functional 
studies will allow the exact mitochondrial genomic 
segments responsible for male sterility in pigeonpea 
to be defined. 



Table 3. Potential chimeric ORFs identified from the no-coverage and rearrangement regions between the ICPA 2039 and ICPB 2039 lines 



ORF start 


ORF stop 


ORF 
length 


Nearest 
gene 


Subject start 


Subject stop 


Chimera 
length 


Identity 


Subject 
features 


No of transmembrane 
helices 


260 331 


260 702 


371 


cox3 


260 342 


260 702 


361 


98 


ORF 


1 


1 64 867 


1 65 424 


557 


nad7 


1 65 424 


1 65 353 


72 


1 00 


ORF 


1 


420 342 


420 902 


560 




233 322 


232 862 


461 


96 


atpt 


0 


534 744 


535 1 1 5 


371 




352 424 


352 403 


361 


98 


nad4 


1 


264435 


265 745 


1310 




60 093 


59 81 3 


281 


100 


rps4 


2 


1 65 464 


1 65 853 


389 


nad7 


265 742 


265 981 


241 


97 


ORF 


3 


1 64 867 


1 65 424 


557 


nad7 


1 65 424 


1 65 353 


72 


100 


ORF 


1 


276 037 


276 405 


368 




468 809 


468 752 


58 


98 


atp9 


3 


396025 


396 876 


851 


mttB 


264 909 


265 396 


488 


97 


ORF 


2 


396285 


396 641 


356 


mttB 


265 144 


265 396 


253 


99 


ORF 


1 


264 069 


264434 


365 




44 935 


45 092 


1 58 


1 00 


nadS 


1 


1 65 633 


1 66 088 


455 


nad7 


265 981 


265 886 


96 


1 00 


ORF 


0 


8534 


8842 


308 


ccmFc 


476 033 


476 052 


20 


1 00 




1 
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